flowchart LR
subgraph dt[nsc-exporter]
check-new-data(((New data?))) --> |Yes| transfer(Transfer to TSD)
transfer --> sleep(Sleep 10 minutes)
sleep --> check-new-data
check-new-data --> |No| sleep
end
subgraph dp[data producer]
lims-exporter[lims-exporter] ---> |produce| check-new-data
pipeline[NSC pipeline] ---> |produce| check-new-data
end
style dt fill:#e4eda6,stroke-width:3px
style dp fill:#aab0a2
1 Background
GDx at OUSAMG is planning to upscale the WGS production to 4 x 48 samples or 2 x 48 + 1 x 96 samples per week. Do we have enough capacity in IT and bioinformatics pipelines for this upscaling?
The capacity of IT & bioinformatics pipelines can be evaluated from following three aspects:
- Data transfer speed
- Data storage
- Pipeline capacity
2 Data transfer speed
Both sequencing data and NSC pipeline results are stored at the Norwegian Sequencing Center (NSC). So the volume of data that needs to be transferred from NSC to the TSD at UiO is very big. The data transfer is done by the nsc-exporter. The nsc-exporter uses TSD s3api which in turn uses s3cmd under the hood. The nsc-exporter will check for new data to transfer every 10 minutes and uses s3cmd put to transfer the data.
2.1 Data Collection
To evaluate the data transfer speed from NSC to TSD, we collected the historical data transfer records between 2023-09-01 08:41:40 and 2023-11-30 20:14:10 from the nsc-exporter log.
[,1]
datetime "2023-11-09 04:14:25"
project "wgs329"
filename "Diag-wgs329-HG20683365.HaplotypeCaller.bam"
bytes "957707"
seconds "0.1"
speed "9400000"
[,1]
datetime "2023-09-02 21:39:54"
project "EKG230830"
filename "HG12261133-PROSTATA-KIT-CuCaV3_S72_R2_001.fastq.gz"
bytes "532922810"
seconds "6.1"
speed "83670000"
[,1]
datetime "2023-09-25 06:02:51"
project "wgs314"
filename "Diag-wgs314-HG93823818C12309-DR.final.vcf"
bytes "1187800404"
seconds "16.2"
speed "69940000"
[,1]
datetime "2023-11-13 12:01:32"
project "EKG231107"
filename "Diag-EKG231107-HG35462927.sample"
bytes "1552"
seconds "0"
speed "33540"
[,1]
datetime "2023-09-23 16:19:23"
project "EKG230920"
filename "230922_NDX551382_RUO_0014_AHMWKJAFX5.HG65895900-MAMMAE-KIT-CuCaV3_S45_R1_001.qc.pdf"
bytes "136425"
seconds "0.1"
speed "2250000"
2.2 Data Overview
The size of transferred files ranges from 0.0 B to 100.9 GiB. The average file size is 1.5 GiB. The median file size is 9.3 KiB. The standard deviation is 8.1 GiB.
filesize
Min. 0.0 B
1st Qu. 421.0 B
Median 9.3 KiB
Mean 1.5 GiB
3rd Qu. 968.0 KiB
Max. 100.9 GiB


The transfer speed ranges from 1.0 B/s to 93.1 MiB/s. The average transfer speed is 12.2 MiB/s. The median transfer speed is 286.7 KiB/s. The standard deviation is 23.4 MiB/s.
speed(/s)
Min. 1.0 B
1st Qu. 11.9 KiB
Median 286.7 KiB
Mean 12.2 MiB
3rd Qu. 8.4 MiB
Max. 93.1 MiB


The transfer time ranges from 0 seconds to 2084.4 seconds. The average transfer time is 19.5 seconds. The median transfer time is 0 seconds. The standard deviation is 104 seconds.
seconds
Min. : 0.00
1st Qu.: 0.00
Median : 0.00
Mean : 19.47
3rd Qu.: 0.10
Max. :2084.40


2.3 Correlation
2.3.1 Transfer speed and time VS file size (all files)
Small files have lower transfer speed. A good transfer speed around 80 MB/s can be achieved for large files (>2 GB). However, the best speed is observed for files with size around 200 MB (zoom in or see Figure 5).

2.3.2 Transfer speed and time VS file size (small files)
Although the transfer speed of small files are very low; the transfer time is usually very short. So small files are not the bottleneck of the data transfer. See also Section 2.5.1.


2.3.3 Maximum transfer reached around 200MB file size?
Small files have lower transfer speed. Large files have higher transfer speed. But it looks like best transfer speed is observed for files with sizearound 200 MB file size.
2.4 Idle Time
To evaluate whether there is capacity for upscaling, we need to know the idle time of the nsc-exporter. The nsc-exporter is idle when it is not transferring data.
All transfer records are plotted with starting time of each transfer on x-axis and the time used to finished the transfer on y-axis. The gaps represnts idle periods of nsc-exporter. The color represents projects, e.g. wgs123, EKG20230901 etc.. The shape represents project type, e.g. wgs, EKG etc. You can turn off a project by clicking it in the legend to the right of the figure.
For easier visualization, the data is grouped in months.
2.4.1 September

Total absolute time used for transferring files in September is 8 days, 20 hours, 35 minutes and 57.20 seconds. In total 62.9 TB data was transferred including 8 new wgs projects.
The nsc-exporter will sleep 10 minutes before checking for new data to transfer. In September, nsc-exporter slept 173 times, totally 1 day, 4 hours and 50 minutes. 3
Another type of time used is the md5sum checking by
s3cmd putcommand which is not counted in the absolute transfer time. 4 The total time used for md5sum checking in September is 2 days, 6 hours, 35 minutes and 17.40 seconds.
2.4.2 October

Total absolute time used for transferring files in October is 9 days, 16 hours, 5 minutes and 58.40 seconds. In total 71.4 TB data was transferred including 10 new wgs projects.
The nsc-exporter will sleep 10 minutes before checking for new data to transfer. In October, nsc-exporter slept 319 times, totally 2 days, 5 hours and 10 minutes.
The total time used for md5sum checking in October is 2 days, 13 hours, 43 minutes and 11.30 seconds.
2.4.3 November

Total absolute time used for transferring files in November is 11 days, 12 hours, 49 minutes and 45 seconds. In total 83.5 TB data was transferred including 12 new wgs projects.
The nsc-exporter will sleep 10 minutes before checking for new data to transfer. In November, nsc-exporter slept 180 times, totally 1 day and 6 hours.
The total time used for md5sum checking in November is 2 days, 23 hours, 4 minutes and 14.80 seconds.
2.5 Discussion
2.5.1 Do we transfer too many small files?
Figure 15 show the number of small files and the number of large files using different boundaries. Figure 16 shows that the time used to transfer small files is neglectable.
| threshold | small files | large files | |
|---|---|---|---|
| 1 | <100 kB | 1940 | 2595561 |
| 2 | <1 MB | 4158 | 2593343 |
| 3 | <10 MB | 6746 | 2590755 |
| 4 | <100 MB | 10596 | 2586905 |
| 5 | <1 GB | 38479 | 2559021 |
2.5.2 Possibility Of One More 48-sample Run Per Week
The nsc-exporter is idle for quite a portion of the time Section 2.4.
- September’s absolute transfer time is 8 days, 20 hours, 35 minutes and 57.20 seconds including 8 wgs projects.
- October’s absolute transfer time is 9 days, 16 hours, 5 minutes and 58.40 seconds including 10 wgs projects.
- November’s absolute transfer time is 11 days, 12 hours, 49 minutes and 45 seconds including 12 wgs projects.
The maximum transfer speed is reached around 200 MB file size. Figure 5 This is the configured chunk size of s3cmd which is the tool used by nsc-exporter for data transfer. We might want to increase the chunk size to improve the transfer speed?
The current transfer speed is not optimal considering the 10Gbps switch connecting NSC and TSD. We need to investigate the reason for the low transfer speed.
2.6 Conclusion
- We might be able to run
4 x 48or2 x 48 + 1 x 96samples per week. Then we are reaching the limit of the current setup. - We need to investigate the reason for the low transfer speed.
- We need to investigate the possibility of increasing the chunk size of s3cmd to improve the transfer speed.
- We need to test if running 2 nsc-exporter processes in parallel can improve the transfer capacity.
3 Data storage
WGS produces large amount of data. The data storage capacity is critical for the upscaling.
3.1 NSC
On NSC side, the data is stored in on boston at /boston/diag. Boston has a total capacity of 1.5 PB, and the usable capacity is 1.2 at the moment.
3.2 TSD
On TSD side, the data is stored in /cluster/projects/p22. The total capacity is 1.8 PB, and the usable capacity is 1.2 PB at the moment.
4 Pipeline capacity (Illumina DRAGEN)
Illunima DRAGEN is a bioinformatics pipeline server that can be used to process WGS data. It takes around 1 hours to process a 30x WGS sample.
To be extended …
5 Discussion
To be addded…
6 Conclusion
To be added…
Footnotes
The nsc-exporter log and sequencer overview html files are very small files and do not belong to any projects. They are always transferred in a very short time. They will not affect the transfer speed of other files. Therefore, they are ignored for simplicity.↩︎
Skipped wgs records where md5sum check time equals/close to zero, so project types with small files/folders are shown.↩︎
Data comes in a continuous manner. The nsc-exporter normally takes a snapshot of new data and tranfer it. It then sleeps for 10 minutes before checking for new data. The slept times counted here are where next sleep is more than the sleeping interval (10 minutes) later, siginifying that new data comes right after the sleep. This is contrary to long idle time where no new data comes after the sleep.↩︎
To make sure files are transferred intact,
s3cmd putchecks the md5sum of the files. This takes time and is not reported as transfer time bys3cmdThe md5sum check is done before starting the transfer. We estimate the md5sum check time by subtracting the transfer time from the time gap between two transfers. The md5sum check time is usually less than 10 minutes, so a gap larger than 10 minutes is not considered as md5sum check time.↩︎